DSTC10-AVSD Submission System with Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning
Published in Proceedings of DSTC10 Workshop at AAAI-2022, 2022
We participated in the third challenge for the Audio-Visual Scene-Aware Dialog (AVSD) task in DSTC10. The target of the task was updated by two modifications: 1) the humancre- ated description is unavailable at inference time, and 2) systems must demonstrate temporal reasoning by finding evidence from the video to support each answer. The baseline system built using an AV-transformer was released along with the new dataset including temporal reasoning for DSTC10-AVSD. This paper introduces a new system that extends the baseline system with attentional multimodal fusion, joint student-teacher learning (JSTL), and model combination techniques, achieving state-of-the-art performances on the AVSD datasets for DSTC7, DSTC8, and DSTC10. We also propose two temporal reasoning methods for AVSD: one attention-based, and one based on a time-domain region proposal network (RPN). We confirmed our system outperformed the baseline system and the previous state of the art for the AVSD test sets for DSTC7, DSTC8, and DSTC10. Furthermore, the temporal reasoning using RPN outperformed the attention method of the baseline system.
Citation: @inproceedings{shah2022dstc10, title={DSTC10-AVSD Submission System with Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning}, author={Shah, Ankit Parag and Hori, Takaaki and Le Roux, Jonathan and Hori, Chiori}}